Update all usages of fleettools to use the installed Agent ID #7054

blakerouse · 2025-02-26T22:11:00Z

What does this PR do?

Updates all integration tests to use the installed Elastic Agent ID from the status output to check with Fleet for information about the specific Elastic Agent.

Why is it important?

This ensures that the tests in the integration framework are only communicating with Fleet about that specific Elastic Agent. This removes the need to filter based on hostname or doing any type of paging with the Kibana API to find that specific Elastic Agent. We know the Elastic Agent ID as the test installed it, it should always use that ID.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
~~[ ] I have added tests that prove my fix is effective or that my feature works~~ (all integration tests)
~~[ ] I have added an entry in ./changelog/fragments using the changelog tool~~ (testing only)
I have added an integration test or an E2E test

Disruptive User Impact

None

How to test this PR locally

mage integration:test

Related issues

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

elasticmachine · 2025-02-26T22:11:03Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

jlind23 · 2025-02-27T05:36:04Z

There were cloud instability while creating deployment, restarting tests.

blakerouse · 2025-02-27T14:34:48Z

buildkite test this

pkoutsovasilis · 2025-02-27T20:46:51Z

pkg/testing/fixture.go

+	defer cancel()
+
+	var lastErr error
+	for {


I believe that it is more appropriate for any retry logic to be up for the caller to implement if necessary, thus, this function shouldn't implicitly keep retrying to get the status of an agent. wdyt? 🙂

That means everywhere in the testing framework will have to add retry logic. I am not a fan of that. It will pollute the testing code. Being it is testing code, I thought it would be best to place the logic here.

I was thinking of splitting it into two functions one with retry and one without, but I couldn't find a single place where I would prefer the function with no retry over the function with retry.

pollution 😄 I never thought of it like that, ok I will keep it in mind next time.

I couldn't find a single place where I would prefer the function with no retry over the function with retry.

To me that sounds like that you would prefer to always call the one with retry, just to stay on the "safe" side. otherwise you wouldn't introduce it to begin with?!

Ok let's do what you say then, if you need to retry because tests lack at the moment the waiting for the AgentID to be successful, let's try to minimise the "pollution", with a separate call that allows at least the caller to specify the retry knobs and call that from everywhere 😉

I also think that deciding if being unable to get an agent id is a showstopper or if retries should be performed (maybe using an assert.Eventually() or some other assertion) should be up to the specific test.

Also assuming that this can last up to 1 minute may be wrong depending on the testcase.

I would prefer to not have "hidden" mechanisms in the utility functions, if we need to change the test code so be it: explicit testcase code is preferable in my opinion.

One more thing: what is the case where an installed and enrolled agent does not have an AgentID exactly? Wouldn't that be an issue with either the enroll operation or the test structure ?

I don't know why we would want to same retry logic in every test. Most developers want a DRY method of development, where the same code is not repeated everywhere. Not having it in the function results in more cases of errors and flakiness in tests as well, if the developer doesn't add the extra code to ensure that retries are performed. Overall retry logic in the function provides cleaner implementation in the test code, which is where we should strive for improved readability. Overall I would prefer to see this type of logic placed into other functions.

I have updated to code to allow the caller to disable retries, adjust the timeout and interval as well. I don't see the need for those honestly, but let see if you like that better.

I would actually prefer to move this logic into ExecStatus, because that is where this really belongs. It is very much a retry on failed connection to a remote GRPC server. It might even be better to place this directly in the elastic-agent status command, but that will not fix previous versions. Being that many of the tests test installation of old versions to upgrade to latest versions, this will not be helpful in tests.

Allright, in that case I'm fine with it. Thanks for the explanation! It would be nice to add a comment to the empty string check to make it clear this is a failsafe.

Incidentally, this isn't related to your change, but how does agent not having an ID but the control protocol server running actually happen?

I have done just that. Moved the retry logic into ExecStatus as that is the appropriate place for that logic. This is very much about retries for communication with the Elastic Agent daemon which is a local GRPC server.

Incidentally, this isn't related to your change, but how does agent not having an ID but the control protocol server running actually happen?

Honestly, I just think it could happen. I was just being defensive. The definition of quality software to me is how it handles the unknown. I don't actually know, maybe it is always set.

I have removed that for now. I guess we will see if it becomes an issue. If we start getting failures saying the Agent ID is empty, we will know.

I have removed that for now. I guess we will see if it becomes an issue. If we start getting failures saying the Agent ID is empty, we will know.

I like that decision. If it can happen, I'd consider it a bug, so a test catching it would be a good thing.

mergify · 2025-02-28T14:11:45Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b integration-use-agent-id-with-fleet upstream/integration-use-agent-id-with-fleet
git merge upstream/main
git push upstream integration-use-agent-id-with-fleet

pkoutsovasilis

until we discuss with the team about what DRY is and when to try to perform it and what developers of integration tests should seek out and all of us collectively agree to a definition, I believe that this PR should not be merged

blakerouse · 2025-02-28T15:30:45Z

until we discuss with the team about what DRY is and when to try to perform it and what developers of integration tests should seek out and all of us collectively agree to a definition, I believe that this PR should not be merged

Happy to discuss.

…nt-id-with-fleet

elastic-sonarqube · 2025-02-28T19:56:49Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube

cmacknz · 2025-02-28T21:38:28Z

pkg/testing/fixture.go

+	if err != nil {
+		return "", err
+	}
+	return status.Info.ID, nil


This is possibly misleading, because even standalone agents have IDs, that are then replaced with one generated by Fleet after enrollment succeeds.

For example I see the following with a local standalone agent. Notice "is_managed": false there but "id": "913ce739-2c6c-45e9-90f5-2226a14bca70" being populated.

sudo elastic-development-agent status --output=json { "info": { "id": "913ce739-2c6c-45e9-90f5-2226a14bca70", "version": "9.1.0", "commit": "d2047ac48df2f4536ca69a86ad4922b3e264501a", "build_time": "2025-02-25 21:52:49 +0000 UTC", "snapshot": true, "pid": 70294, "unprivileged": false, "is_managed": false }, "state": 2, "message": "Running",

Just looking at the ID at any one point in time is not going to give you a valid ID for making requests to Fleet.

We probably want an explicit entry in the status output for the ID as assigned by Fleet so we can poll for it to be populated. Otherwise I worry there will be a race conditions in tests where sometimes the standalone ID is picked up before it replaced by the one assigned during enrollment.

It is not misleading, but it does greatly depend on when you ask for the AgentID. You must check after enrollment has occurred. You don't need to worry about it picking up the wrong ID, as long as you are calling it at the correct time. I think AgentID() is also useful in the standalone case, so I don't think check if is_managed: true would be correct for this type of call.

elasticmachine · 2025-03-01T04:45:39Z

💚 Build Succeeded

Buildkite Build
Commit: 6561c3c

History

💔 Build #17786 failed bd56d4e
💔 Build #17778 failed fdd8912
💔 Build #17702 failed 6bad410
💔 Build #17673 failed a22179f
💔 Build #17655 failed 5aded1a
💔 Build #17638 failed e9b9089

cc @blakerouse

blakerouse added 2 commits February 26, 2025 17:01

Update all usages of fleettools to use the installed Agent ID.

cbe907e

Fix usage.

e9b9089

blakerouse added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team skip-changelog backport-8.x Automated backport to the 8.x branch with mergify backport-8.18 Automated backport to the 8.18 branch backport-9.0 Automated backport to the 9.0 branch labels Feb 26, 2025

blakerouse self-assigned this Feb 26, 2025

blakerouse requested a review from a team as a code owner February 26, 2025 22:11

blakerouse requested review from andrzej-stencel and pchila February 26, 2025 22:11

blakerouse added 2 commits February 27, 2025 14:32

Merge branch 'main' into integration-use-agent-id-with-fleet

5aded1a

Fix Fixture.AgentID to try harder to determine the AgentID.

a22179f

pkoutsovasilis reviewed Feb 27, 2025

View reviewed changes

pkoutsovasilis mentioned this pull request Feb 27, 2025

[flakiness]: List agents by hostname and policy id #7012

Merged

8 tasks

blakerouse added 2 commits February 27, 2025 23:12

Merge branch 'main' into integration-use-agent-id-with-fleet

6bad410

Adjust call location. Add opts to AgentID function.

fdd8912

pkoutsovasilis reviewed Feb 28, 2025

View reviewed changes

blakerouse added 4 commits February 28, 2025 10:32

Merge remote-tracking branch 'upstream/main' into integration-use-age…

c305467

…nt-id-with-fleet

Refactor ExecStatus to have retries.

bd56d4e

Fix TestFQDN.

0fd26db

Merge remote-tracking branch 'upstream/main' into integration-use-age…

6561c3c

…nt-id-with-fleet

cmacknz reviewed Feb 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update all usages of fleettools to use the installed Agent ID #7054

Update all usages of fleettools to use the installed Agent ID #7054

blakerouse commented Feb 26, 2025 •

edited by jlind23

Loading

elasticmachine commented Feb 26, 2025

jlind23 commented Feb 27, 2025

blakerouse commented Feb 27, 2025

pkoutsovasilis Feb 27, 2025

blakerouse Feb 27, 2025

pkoutsovasilis Feb 27, 2025 •

edited

Loading

pchila Feb 28, 2025

blakerouse Feb 28, 2025

blakerouse Feb 28, 2025

swiatekm Feb 28, 2025

blakerouse Feb 28, 2025

blakerouse Feb 28, 2025

swiatekm Feb 28, 2025

mergify bot commented Feb 28, 2025

pkoutsovasilis left a comment •

edited

Loading

blakerouse commented Feb 28, 2025

elastic-sonarqube bot commented Feb 28, 2025

cmacknz Feb 28, 2025

blakerouse Mar 3, 2025

elasticmachine commented Mar 1, 2025

Update all usages of fleettools to use the installed Agent ID #7054

Are you sure you want to change the base?

Update all usages of fleettools to use the installed Agent ID #7054

Conversation

blakerouse commented Feb 26, 2025 • edited by jlind23 Loading

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

elasticmachine commented Feb 26, 2025

jlind23 commented Feb 27, 2025

blakerouse commented Feb 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pkoutsovasilis Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Feb 28, 2025

pkoutsovasilis left a comment • edited Loading

Choose a reason for hiding this comment

blakerouse commented Feb 28, 2025

elastic-sonarqube bot commented Feb 28, 2025

Quality Gate passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented Mar 1, 2025

💚 Build Succeeded

History

blakerouse commented Feb 26, 2025 •

edited by jlind23

Loading

pkoutsovasilis Feb 27, 2025 •

edited

Loading

pkoutsovasilis left a comment •

edited

Loading